Exploration of Different Imputation Methods for Missing Data

Karthik Aerra, Liz Miller, Mohit Kumar Veeraboina, Robert Stairs

2024-07-27

Introduction

Placeholder Title 1

  • Bullet 1
  • Bullet 2
  • Bullet 3

Placeholder Title 2

  • Bullet 1
  • Bullet 2
  • Bullet 3

Placeholder Title 3

  • Bullet 1
  • Bullet 2
  • Bullet 3

Placeholder Title 4

  • Bullet 1
  • Bullet 2
  • Bullet 3

Placeholder Title 5

  • Bullet 1
  • Bullet 2
  • Bullet 3

Methods

Placeholder Title 6

  • Bullet 1
  • Bullet 2
  • Bullet 3

Placeholder Title 7

  • Bullet 1
  • Bullet 2
  • Bullet 3

Placeholder Title 8

  • Bullet 1
  • Bullet 2
  • Bullet 3

Placeholder Title 9

  • Bullet 1
  • Bullet 2
  • Bullet 3

Analysis and Results

Introduction to the Dataset

  • “Ozone: Los Angeles Ozone Pollution Data, 1976” from mlbench package in R (“Ozone”)
  • It contains observations related to pollution levels in the Los Angeles area during 1976.
  • It contains a total of 13 variables, 366 observations (one for each day for one year)
  • We chose this dataset due to the high volume of already-missing data
  • Demonstration of a real-life scenario, where missing values are truly unknown and effectiveness of imputation methods cannot directly be measured.
  • It is up to the investigator to choose a methodology for handling the missing data and appropriate metrics for evaluating effectiveness of missing data methods

Variables in the Ozone Dataset

  • V1: Month, 1-12, where 1 is January and 12 is December
  • V2: Day of month
  • V3: Day of week, 1-7, where 1 is Monday and 7 is Sunday
  • V4: Daily maximum one-hour-average ozone reading
  • V5: 500 millibar pressure height (m) measured at Vandenberg AFB
  • V6: Wind speed (mph) at Los Angeles International Airport (LAX)
  • V7: Humidity (%) at LAX
  • V8: Temperature (degrees F) measured at Sandburg, CA
  • V9: Temperature (degrees F) measured at El Monte, CA
  • V10: Inversion base height (feet) at LAX
  • V11: Pressure gradient (mm Hg) from LAX to Daggett, CA
  • V12: Inversion base temperature (degrees F) at LAX
  • V13: Visibility (miles) measured at LAX

Summarizing the Data

  • The summary() function shows the number of NAs for each column
  • Most columns have <20 NAs, however V9 contains 139 missing values
  • V9: Temperature (degrees F) measured at El Monte, CA
       V4              V1            V2      V3           V5      
 Min.   : 1.00   1      : 31   1      : 12   1:52   Min.   :5320  
 1st Qu.: 5.00   3      : 31   2      : 12   2:52   1st Qu.:5700  
 Median : 9.00   5      : 31   3      : 12   3:52   Median :5770  
 Mean   :11.53   7      : 31   4      : 12   4:53   Mean   :5753  
 3rd Qu.:16.00   8      : 31   5      : 12   5:53   3rd Qu.:5830  
 Max.   :38.00   10     : 31   6      : 12   6:52   Max.   :5950  
 NA's   :5       (Other):180   (Other):294   7:52   NA's   :12    
       V6               V7              V8              V9       
 Min.   : 0.000   Min.   :19.00   Min.   :25.00   Min.   :27.68  
 1st Qu.: 3.000   1st Qu.:49.00   1st Qu.:51.00   1st Qu.:49.73  
 Median : 5.000   Median :65.00   Median :62.00   Median :57.02  
 Mean   : 4.869   Mean   :58.48   Mean   :61.91   Mean   :56.85  
 3rd Qu.: 6.000   3rd Qu.:73.00   3rd Qu.:72.00   3rd Qu.:66.11  
 Max.   :11.000   Max.   :93.00   Max.   :93.00   Max.   :82.58  
                  NA's   :15      NA's   :2       NA's   :139    
      V10            V11             V12             V13       
 Min.   : 111   Min.   :-69.0   Min.   :27.50   Min.   :  0.0  
 1st Qu.: 890   1st Qu.:-10.0   1st Qu.:51.26   1st Qu.: 70.0  
 Median :2125   Median : 24.0   Median :62.24   Median :110.0  
 Mean   :2591   Mean   : 17.8   Mean   :60.93   Mean   :123.3  
 3rd Qu.:5000   3rd Qu.: 45.0   3rd Qu.:70.52   3rd Qu.:150.0  
 Max.   :5000   Max.   :107.0   Max.   :91.76   Max.   :500.0  
 NA's   :15     NA's   :1       NA's   :14                     

Data Pre-Processing Summary

  • All columns converted to numeric data type
  • Renamed columns for clearer interpretation during analysis and reporting
  • Combined month, day of month and day of week into one data column

Data Exploration Summary

  • Summary statistics by day of week, month
  • Histograms to visualize distributions of data
  • Correlation coefficients for all features with respect to Ozone levels (output)
  • Strong Positive Correlations: Humidity_LAX, Pressure_afb, IBT_LAX, Temp_EM, and Temp_sandburg
  • Strong Negative Correlations: IBH_LAX and Visibility_LAX

Exploration of the Missing Data

  • The total number of missing data points for each column is shown below as a count and as a percentage.
  • Most of the columns contain <5% missing values. Temp_EM contains 139 missing values, which is 38.0% of the observations.
  • The total number of missing values in the dataset is 203 out of 4,768 (13 columns times 366 observations). This represents 4.3% of the entire dataset.
  • There are a few instances where more than one column has missing data, but the majority of rows with missing values are only missing Temp_EM

Statistical Testing for MCAR (Missing Completely at Random)

Randomness Testing of Missing Data The Little’s MCAR test was performed to analyze randomness. * Null Hypothesis (H0): The data is missing completely at random (MCAR). * Alternative Hypothesis (H1): The data is not missing completely at random. Based on the p-value(4.102052e-12) from the test, the null hypothesis is rejected, data is not missing completely at random (MCAR).

# Perform Little's MCAR test
mcar_result <- mcar_test(ozone2)
print(mcar_result)
# A tibble: 1 × 4
  statistic    df  p.value missing.patterns
      <dbl> <dbl>    <dbl>            <int>
1      278.   134 4.10e-12               13

##Additionally, the Hawkin’s and the Non-Parametric tests were performed to analyze missing data randomness.

Test Data for Normality In order to interpret the tests, the data must be checked for normality. The Shapiro-Wilk and the Anderson-Darling tests were performed.


    Shapiro-Wilk normality test

data:  ozone_levels
W = 0.91044, p-value = 8.37e-14

    Anderson-Darling normality test

data:  ozone_levels
A = 10.027, p-value < 2.2e-16

A histogram and QQ plot were also created

Run Hawkins and Non-Parametric Tests

Call:
MissMech::TestMCARNormality(data = .)

Number of Patterns:  4 

Total number of cases used in the analysis:  354 

 Pattern(s) used:
          Temp_EM   Month   Day_of_month   Day_of_week   Pressure_afb
group.1        NA       1              1             1              1
group.2         1       1              1             1              1
group.3         1       1              1             1              1
group.4         1       1              1             1             NA
          Wind_speed_LAX   Humidity_LAX   Temp_sandburg   IBH_LAX
group.1                1              1               1         1
group.2                1              1               1         1
group.3                1             NA               1        NA
group.4                1              1               1         1
          Pressure_gradient   IBT_LAX   Visibility_LAX   Number of cases
group.1                   1         1                1               129
group.2                   1         1                1               206
group.3                   1        NA                1                 9
group.4                   1         1                1                10

    Test of normality and Homoscedasticity:
  -------------------------------------------

Hawkins Test:

    P-value for the Hawkins test of normality and homoscedasticity:  0.0113103 

    Either the test of multivariate normality or homoscedasticity (or both) is rejected.
    Provided that normality can be assumed, the hypothesis of MCAR is 
    rejected at 0.05 significance level. 

Non-Parametric Test:

    P-value for the non-parametric test of homoscedasticity:  0.2743229 

    Reject Normality at 0.05 significance level.
    There is not sufficient evidence to reject MCAR at 0.05 significance level.

Hawkins Test

  • P-value for the Hawkins test of normality and homoscedasticity: 0.0113103
  • This test checks for both multivariate normality and homoscedasticity (equal variances).
  • Interpretation: Since the p-value is less than 0.05, the null hypothesis of multivariate normality or homoscedasticity (or both) is rejected. This means that the data does not follow a multivariate normal distribution, or the variances are not equal across groups.
  • MCAR Hypothesis: Provided that the above tests show that the data is not normally distributed, the hypothesis of MCAR (Missing Completely At Random) is not rejected at the 0.05 significance level.

Non-Parametric Test

  • P-value for the non-parametric test of homoscedasticity: 0.2743229
  • This test checks for homoscedasticity without assuming normality.
  • Interpretation: Since the p-value is greater than 0.05, there is not enough evidence to reject the null hypothesis of homoscedasticity. This suggests that the variances are equal across groups.
  • MCAR Hypothesis: There is not sufficient evidence to reject the hypothesis of MCAR at the 0.05 significance level.

Further Visualization of the Missing Data: Missingness Patterns

The data were further visualized using the gg_miss_fct() function to plot the proportion of missing values for each column (y-axis) broken down by one variable as a time (x-axis). The purpose of this is to determine if there are discernible patterns in the missingness of the data. This provides additional supportive information for classifying the data as MCAR, MAR, or MNAR.

Focusing on Temp_EM since it is the column with the highest number of missing values, the data appear to be missing randomly with respect to most of the variables. With respect to the date columns, the pattern of missingness appears to be random with respect to day of the month. However, it is apparent that there is a high proportion of missing data on days 6 and 7 of the week (Saturdays and Sundays). This indicates that Temp_EM was likely not measured on the weekends throughout the year. The values for Temp_EM also appear to be missing randomly with respect to month, though there is a notable high frequency of missing data around the month of May.

For the other variables in the dataset, there does not appear to be any obvious patterns to the missing data. Most importantly, the frequency of missing data for Temp_EM does not appear to depend on the output variable (ozone reading). Therefore, we can most likely proceed to imputation with the assumption that our data are missing completely at random (MCAR).

# Create gg_miss_fct plots with adjusted themes
p1 <- gg_miss_fct(ozone2, fct = Month) + 
  ggtitle("Missing Data by Month") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(plot.title = element_text(size=8))

p2 <- gg_miss_fct(ozone2, fct = Day_of_month) + 
  ggtitle("Missing Data by Day of Month") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(plot.title = element_text(size=8))

p3 <- gg_miss_fct(ozone2, fct = Day_of_week) + 
  ggtitle("Missing Data by Day of Week") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(plot.title = element_text(size=8))

p4 <- gg_miss_fct(ozone2, fct = Ozone_reading) + 
  ggtitle("Missing Data by Ozone Reading") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(plot.title = element_text(size=8))

p5 <- gg_miss_fct(ozone2, fct = Pressure_afb) + 
  ggtitle("Missing Data by Solar Radiation") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(plot.title = element_text(size=8))

p6 <- gg_miss_fct(ozone2, fct = Wind_speed_LAX) + 
  ggtitle("Missing Data by Wind Speed") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(plot.title = element_text(size=8))

p7 <- gg_miss_fct(ozone2, fct = Humidity_LAX) + 
  ggtitle("Missing Data by Humidity (LAX)") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(plot.title = element_text(size=8))

p8 <- gg_miss_fct(ozone2, fct = Temp_sandburg) + 
  ggtitle("Missing Data by Temperature (Sandburg)") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(plot.title = element_text(size=8))

p9 <- gg_miss_fct(ozone2, fct = Temp_EM) + 
  ggtitle("Missing Data by Temperature (EM)") +
  theme(plot.title = element_text(size=8))

p10 <- gg_miss_fct(ozone2, fct = IBH_LAX) + 
  ggtitle("Missing Data by IBH_LAX") +
  theme(plot.title = element_text(size=8))

p11 <- gg_miss_fct(ozone2, fct = Pressure_gradient) + 
  ggtitle("Missing Data by Pressure Gradient") +
  theme(plot.title = element_text(size=8))

p12 <- gg_miss_fct(ozone2, fct = IBT_LAX) + 
  ggtitle("Missing Data by IBT_LAX") +
  theme(plot.title = element_text(size=8))

p13 <- gg_miss_fct(ozone2, fct = Visibility_LAX) + 
  ggtitle("Missing Data by Visibility_LAX") +
  theme(plot.title = element_text(size=8))

# Arrange the plots into grids with proper spacing
grid1 <- grid.arrange(p1, p2, p3, p4, nrow = 2)

grid2 <- grid.arrange(p5, p6, p7, p8, nrow = 2)

grid3 <- grid.arrange(p9, p10, p11, nrow = 2)

grid4 <- grid.arrange(p12, p13, nrow = 1)

Missing Data Imputation Using Simple Methods

Missing data were deleted using listwise deletion or feature selection as a baseline comparison for imputation methods. Additionally, missing data imputed using mean, median or mode as a demonstration of simple imputation techniques. Listwise deletion is simply deleting all rows that contain a missing value. Feature removal is where a column is deleted if its total percentage of missing values is over a set threshold (we chose 20%). With mean, median, or mode imputation, all NA values are replaced by a single value for each column: the mean, median, or mode of that column. One dataframe was created for each method, resulting in a total of five dataframes:

Feature selection (column deletion) In this case, the Temp_EM column is dropped because more than 20% of the values are missing. Remaining rows with NAs are also dropped so that machine learning models can be applied.

# Function to drop column if quantity of missing values is over the threshold
drop_na_columns <- function(data, threshold) {
  na_counts <- colSums(is.na(data))
  na_proportion <- na_counts / nrow(data)
  data <- data[, na_proportion <= threshold]
  return(data)
}
# Define threshold (e.g., 20% NA allowed)
threshold <- 0.20

# Drop columns based on the NA threshold
dropCol_data <- drop_na_columns(ozone2, threshold) # the column Temp_EM gets dropped
dropCol_data <- data.frame(dropCol_data)
dropCol_data <- na.omit(dropCol_data)

head(dropCol_data)
  Ozone_reading Month Day_of_month Day_of_week Pressure_afb Wind_speed_LAX
3             3     1            3           6         5710              4
4             5     1            4           7         5700              3
5             5     1            5           1         5760              3
6             6     1            6           2         5720              4
7             4     1            7           3         5790              6
8             4     1            8           4         5790              3
  Humidity_LAX Temp_sandburg IBH_LAX Pressure_gradient IBT_LAX Visibility_LAX
3           28            40    2693               -25   47.66            250
4           37            45     590               -24   55.04            100
5           51            54    1450                25   57.02             60
6           69            35    1568                15   53.78             60
7           19            45    2631               -33   54.14            100
8           25            55     554               -28   64.76            250
print(sum(is.na(dropCol_data)))
[1] 0

Listwise deletion (row deletion)

Create a new dataset where all rows with NAs are deleted.

# Drop all missing values. Row deletion.
dropNA_data <- na.omit(ozone2)

head(dropNA_data)
   Ozone_reading Month Day_of_month Day_of_week Pressure_afb Wind_speed_LAX
5              5     1            5           1         5760              3
6              6     1            6           2         5720              4
7              4     1            7           3         5790              6
8              4     1            8           4         5790              3
9              6     1            9           5         5700              3
12             6     1           12           1         5720              3
   Humidity_LAX Temp_sandburg Temp_EM IBH_LAX Pressure_gradient IBT_LAX
5            51            54   45.32    1450                25   57.02
6            69            35   49.64    1568                15   53.78
7            19            45   46.40    2631               -33   54.14
8            25            55   52.70     554               -28   64.76
9            73            41   48.02    2083                23   52.52
12           44            51   54.32     111                 9   63.14
   Visibility_LAX
5              60
6              60
7             100
8             250
9             120
12            150
print(sum(is.na(dropNA_data)))
[1] 0

Mean imputation

Create a new dataset where all missing values are replaced with the mean of the column.

# Function for mean imputation
mean_impute <- function(x) {
  x[is.na(x)] <- mean(x, na.rm = TRUE)
  return(x)
}

imputed_data_mean <- apply(ozone2, 2, mean_impute)
imputed_data_mean <- data.frame(imputed_data_mean)

head(imputed_data_mean)
  Ozone_reading Month Day_of_month Day_of_week Pressure_afb Wind_speed_LAX
1             3     1            1           4         5480              8
2             3     1            2           5         5660              6
3             3     1            3           6         5710              4
4             5     1            4           7         5700              3
5             5     1            5           1         5760              3
6             6     1            6           2         5720              4
  Humidity_LAX Temp_sandburg  Temp_EM  IBH_LAX Pressure_gradient  IBT_LAX
1     20.00000      61.91484 56.84634 5000.000               -15 30.56000
2     58.47578      38.00000 56.84634 2590.943               -14 60.92733
3     28.00000      40.00000 56.84634 2693.000               -25 47.66000
4     37.00000      45.00000 56.84634  590.000               -24 55.04000
5     51.00000      54.00000 45.32000 1450.000                25 57.02000
6     69.00000      35.00000 49.64000 1568.000                15 53.78000
  Visibility_LAX
1            200
2            300
3            250
4            100
5             60
6             60
print(sum(is.na(imputed_data_mean)))
[1] 0

Median imputation

Create a new dataset where all missing values are replaced with the median of the column.

# Function for median imputation
median_impute <- function(x) {
  x[is.na(x)] <- median(x, na.rm = TRUE)
  return(x)
}
imputed_data_median <- apply(ozone2, 2, median_impute)
imputed_data_median <- data.frame(imputed_data_median)

head(imputed_data_median)
  Ozone_reading Month Day_of_month Day_of_week Pressure_afb Wind_speed_LAX
1             3     1            1           4         5480              8
2             3     1            2           5         5660              6
3             3     1            3           6         5710              4
4             5     1            4           7         5700              3
5             5     1            5           1         5760              3
6             6     1            6           2         5720              4
  Humidity_LAX Temp_sandburg Temp_EM IBH_LAX Pressure_gradient IBT_LAX
1           20            62   57.02    5000               -15   30.56
2           65            38   57.02    2125               -14   62.24
3           28            40   57.02    2693               -25   47.66
4           37            45   57.02     590               -24   55.04
5           51            54   45.32    1450                25   57.02
6           69            35   49.64    1568                15   53.78
  Visibility_LAX
1            200
2            300
3            250
4            100
5             60
6             60
print(sum(is.na(imputed_data_median)))
[1] 0

Mode imputation

Create a new dataset where all missing values are replaced with the mode of the column.

# Function for mode imputation (using the most common value)
mode_impute <- function(x) {
  mode_val <- as.numeric(names(sort(table(x), decreasing = TRUE)[1]))
  x[is.na(x)] <- mode_val
  return(x)
}
imputed_data_mode <- apply(ozone2, 2, mode_impute)
imputed_data_mode <- data.frame(imputed_data_mode)

head(imputed_data_mode)
  Ozone_reading Month Day_of_month Day_of_week Pressure_afb Wind_speed_LAX
1             3     1            1           4         5480              8
2             3     1            2           5         5660              6
3             3     1            3           6         5710              4
4             5     1            4           7         5700              3
5             5     1            5           1         5760              3
6             6     1            6           2         5720              4
  Humidity_LAX Temp_sandburg Temp_EM IBH_LAX Pressure_gradient IBT_LAX
1           20            51   54.32    5000               -15   30.56
2           19            38   54.32    5000               -14   68.72
3           28            40   54.32    2693               -25   47.66
4           37            45   54.32     590               -24   55.04
5           51            54   45.32    1450                25   57.02
6           69            35   49.64    1568                15   53.78
  Visibility_LAX
1            200
2            300
3            250
4            100
5             60
6             60
print(sum(is.na(imputed_data_mode)))
[1] 0

Missing Data Imputation Using MICE (Multiple Imputation Method) and missForest (Machine Learning Method)

Next, the missing data were imputed using MICE and Miss_Forest methods.

missForest

library(missForest)

# verbose = If 'TRUE' the user is supplied with additional output between iterations
# xtrue = Complete data matrix
ozone2_mf <- missForest(ozone2, xtrue = ozone2, verbose = TRUE)
# convert back to data frame
ozone2_mf <- as.data.frame(ozone2_mf$ximp)
print(sum(is.na(ozone2_mf)))

## The final results can be accessed directly. The estimated error:
ozone2_mf$OOBerror

## The true imputation error (if available):
ozone2_mf$error

MICE

library(mice)

# Impute missing values using MICE
# pmm = Predictive Mean Matching (suitable for continuous variables like temperature, wind, etc.)
# m = 5: number of imputed datasets to create.
# maxit = 50: max number of iterations
ozone2_mice <- mice(ozone2, method = "pmm", m = 5, maxit = 50)

# extracts the completed datasets from the mice object
ozone2_mice <- complete(ozone2_mice)

# Convert completed data to data frame
ozone2_mice <- as.data.frame(ozone2_mice)

print(sum(is.na(ozone2_mice)))

Model-Fitting of Imputed Datasets

Make Predictions Using Data Where Missing Values Were Deleted Using Listwise Deletion (Deletion of all missing rows with missing data)

The listwise deleted dataset was split into testing and training datasets using a split ratio of 0.75. Models were then fit using the following algorithms:

  • Random Forest
  • K-Nearest Neighbors (KNN)
  • Decision Tree

After models were fit to the training data, predictions were made on the unseen test data. RMSE was then reported for each model.

Load additional libraries and split the data into training and testing sets

#Load additional libraries for model fitting
library(caret) # for fitting KNN models
library(e1071)
library(rsample) # for creating validation splits
library(recipes)    # for feature engineering
library(randomForest)
library(rpart)# decision tree
library(tidymodels) 
library(class) 
library(vip) 

# Split the data into training and testing sets
set.seed(123)
data_split_dropNA <- initial_split(dropNA_data, prop = 0.75)
train_dropNA <- training(data_split_dropNA)
test_dropNA <- testing(data_split_dropNA)

# Split data into predictors and target
X <- train_dropNA[, -1]  # Features
y <- train_dropNA$Ozone_reading  # Target

# Function to calculate RMSE
rmse <- function(pred, actual) {
  sqrt(mean((pred - actual)^2))
}

Random Forest

# Random forest model
rf_dropNA <- randomForest(Ozone_reading ~ ., data = train_dropNA)
# make predictions
rf_dropNA_pred <- predict(rf_dropNA, newdata = test_dropNA)
# Plot variable importance
varImpPlot(rf_dropNA, main = "Variable Importance Plot for Random Forest Model")

rf_dropNA_rmse <- rmse(rf_dropNA_pred, test_dropNA$Ozone_reading)
cat("Random Forest RMSE:", rf_dropNA_rmse, "\n")
Random Forest RMSE: 4.591508 

KNN

# KNN model
knn_dropNA <- train(Ozone_reading ~ ., data = train_dropNA, method = "knn")
# make predictions
knn_dropNA_pred <- predict(knn_dropNA, newdata = test_dropNA)
# Plot variable importance
feature_importance <- table(y) / length(y)
barplot(feature_importance, main = "Feature Importance for KNN", xlab = "Feature", ylab = "Importance")

knn_dropNA_rmse <- rmse(knn_dropNA_pred, test_dropNA$Ozone_reading)
cat("KNN RMSE:", knn_dropNA_rmse, "\n")
KNN RMSE: 6.161134 

Decision Tree

# Decision tree model
tree_dropNA <- rpart(Ozone_reading ~ ., data = train_dropNA)
# make predictions
tree_dropNA_pred <- predict(tree_dropNA, newdata = test_dropNA)
# Plot variable importance
var_importance <- varImp(tree_dropNA)
barplot(var_importance$Overall, main = "Variable Importance for Decision Tree", xlab = "Variable", ylab = "Importance")

tree_dropNA_rmse <- rmse(tree_dropNA_pred, test_dropNA$Ozone_reading)
cat("Decision Tree RMSE:", tree_dropNA_rmse, "\n")
Decision Tree RMSE: 5.799608 

Make Predictions Using Data Where Columns with >20% Missing Data Were Deleted (Temp_EM column)

The deleted column dataset was split into testing and training datasets using a split ratio of 0.75. Models were then fit using the following algorithms:

  • Random Forest
  • K-Nearest Neighbors (KNN)
  • Decision Tree

After models were fit to the training data, predictions were made on the unseen test data. RMSE was then reported for each model.

For the column deletion method (feature selection), random forest had the best model fit, with an RMSE of 3.66.

Random Forest

Random Forest RMSE: 3.659657 

KNN

KNN RMSE: 5.548309 

Decision Tree

Decision Tree RMSE: 4.266184 

Make Predictions Using Data Where Missing Values were Imputed using the Mean

The mean imputed dataset was split into testing and training datasets using a split ratio of 0.75. Models were then fit using the following algorithms:

  • Random Forest
  • K-Nearest Neighbors (KNN)
  • Decision Tree

After models were fit to the training data, predictions were made on the unseen test data. RMSE was then reported for each model.

For the mean imputation method, random forest had the best model fit, with an RMSE of 4.90.

Random Forest RMSE: 4.901004 

KNN

KNN RMSE: 6.117349 

Decision Tree

Decision Tree RMSE: 5.213247 

Make Predictions Using Data Where Missing Values were Imputed using the Median

The mean imputed dataset was split into testing and training datasets using a split ratio of 0.75. Models were then fit using the following algorithms:

  • Random Forest
  • K-Nearest Neighbors (KNN)
  • Decision Tree

After models were fit to the training data, predictions were made on the unseen test data. RMSE was then reported for each model.

For the mean imputation method, random forest had the best model fit, with an RMSE of 5.07.

Random Forest

Random Forest RMSE: 5.073677 

KNN

KNN RMSE: 6.298989 

Decision Tree

Decision Tree RMSE: 5.249752 

Make Predictions Using Data Where Missing Values were Imputed using the Mode

The mean imputed dataset was split into testing and training datasets using a split ratio of 0.75. Models were then fit using the following algorithms:

  • Random Forest
  • K-Nearest Neighbors (KNN)
  • Decision Tree

After models were fit to the training data, predictions were made on the unseen test data. RMSE was then reported for each model.

For the mode imputation method, random forest had the best model fit, with an RMSE of 5.40.

Random Forest

Random Forest RMSE: 5.398429 

KNN

KNN RMSE: 6.371706 

Decision Tree

Decision Tree RMSE: 6.603588 

Make Predictions using data where missing values were imputed with complex methods:

Make Predictions Using Data Where Missing Values were Imputed using missForest Method

The missForest imputed dataset was split into testing and training datasets using a split ratio of 0.75. Models were then fit using the following algorithms:

  • Random Forest
  • K-Nearest Neighbors (KNN)
  • Decision Tree

After models were fit to the training data, predictions were made on the unseen test data. RMSE was then reported for each model.

For the missForest imputation method, random forest had the best model fit, with an RMSE of 4.46.

Random Forest

Random Forest RMSE: 4.493497 

KNN

KNN RMSE: 6.118703 

Decision Tree

Decision Tree RMSE: 5.537124 

Make Predictions Using Data Where Missing Values were Imputed using MICE Method

The MICE imputed dataset was split into testing and training datasets using a split ratio of 0.75. Models were then fit using the following algorithms:

  • Random Forest
  • K-Nearest Neighbors (KNN)
  • Decision Tree

After models were fit to the training data, predictions were made on the unseen test data. RMSE was then reported for each model.

For the MICE imputation method, random forest had the best model fit, with an RMSE of 4.70.

Random Forest

Random Forest RMSE: 4.367134 

KNN

KNN RMSE: 6.142781 

Decision Tree

Decision Tree RMSE: 5.077593 

Summarize Model Performance for Each Imputation Methodology

The RMSE for each model and imputation method combination was summarized into a data frame. The results are shown below. The best model fit (lowest RMSE) was obtained when using feature selection (column deletion) for imputation of missing data and the random forest algorithm for model training with the resulting dataset. Mean imputation with the Decision Tree model fitting was the second best combination, followed by mode imputation with Random Forest.

models <- c('RandomForest', 'KNN', 'DecisionTree')
scores <- c(rf_dropCol_rmse, knn_dropCol_rmse, tree_dropCol_rmse,
            rf_dropNA_rmse, knn_dropNA_rmse, tree_dropNA_rmse,
            rf_mean_rmse, knn_mean_rmse, tree_mean_rmse,
            rf_med_rmse, knn_med_rmse, tree_med_rmse,
            rf_mode_rmse, knn_mode_rmse, tree_mode_rmse,
            rf_miss_rmse, knn_miss_rmse, tree_miss_rmse,
            rf_mice_rmse, knn_mice_rmse, tree_mice_rmse
)
ImpMethod <- c('DropCol','DropNA','Mean', 'Median', 'Mode','missForest','MICE')

# Create dataframe
rmse_df <- data.frame(Model = models, ImpMethod=ImpMethod,RMSE = scores)
print(rmse_df[order(rmse_df$RMSE), ])
          Model  ImpMethod     RMSE
1  RandomForest    DropCol 3.659657
3  DecisionTree       Mean 4.266184
19 RandomForest       Mode 4.367134
16 RandomForest     DropNA 4.493497
4  RandomForest     Median 4.591508
7  RandomForest       MICE 4.901004
10 RandomForest       Mean 5.073677
21 DecisionTree       MICE 5.077593
9  DecisionTree     DropNA 5.213247
12 DecisionTree       Mode 5.249752
13 RandomForest missForest 5.398429
18 DecisionTree     Median 5.537124
2           KNN     DropNA 5.548309
6  DecisionTree missForest 5.799608
8           KNN    DropCol 6.117349
17          KNN       Mean 6.118703
20          KNN missForest 6.142781
5           KNN       Mode 6.161134
11          KNN     Median 6.298989
14          KNN       MICE 6.371706
15 DecisionTree    DropCol 6.603588

Conclusions

In summary, maintaining the integrity and dependability of data analysis procedures depends on the management of missing data. This work demonstrates various methodologies that can be used to handle missing data, including listwise and feature selection deletion methods, machine learning-based algorithms like missForest, single imputation techniques such as mean, mode, median imputation, and multiple imputation using MICE. The dataset “Ozone”, from the mlbench package in R was used to demonstrate the techniques for either deleting or imputing missing data. Statistical tests and visualization techniques to help classify data as MCAR, MAR, and NMAR were also demonstrated. With respect to the “Ozone” dataset, deletion of the column Temp_EM, which contains ~38% missing data, was found to be the most effective method for handling missing data. This was demonstrated through machine learning models trained to datasets where missing data were handled using the various techniques described above. A total of seven datasets were created, and each dataset was trained and tested with KNN, decision tree, and random forest algorithms. The random forest model in combination with feature selection for handling missing data resulted in the lowest RMSE with respect to the test (unseen) dataset. This may have been the best result in this instance because this data appear to be MCAR based on statistical testing. It is best practice to test a variety of different methods, as was performed in this work, and predefine success criteria when handling missing data. To provide reliable and accurate results from data analysis, the imputation method selected should ultimately be in line with the unique features of the dataset and the analytical objectives.

References